影响红酒质量的因素评估

========================================================

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

通过分析数据发现,这里包含的1599条数据中,酒的质量评分在3~8分之间。 没有评分非常高(10)和评分非常低的(1)的数据。

Univariate Plots Section

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.274   2.500   5.000

通过图像分析,我们可以看到大部分的甜度在1.5~2.5之间。(g / dm^3)

ggplot(red_wine, aes(alcohol)) +
  geom_histogram(binwidth = 0.1) +
  geom_vline(xintercept = median(red_wine$alcohol), color = 'royalblue') +
  geom_vline(xintercept = mean(red_wine$alcohol), color = 'coral')

summary(red_wine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(red_wine, aes(x = chlorides)) +
  geom_histogram() +
  xlim(quantile(red_wine$chlorides, 0.05), quantile(red_wine$chlorides, 0.95)) +
  xlab("chlorides (middle 95%)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 158 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(subset(red_wine$chlorides,
               red_wine$chlorides < quantile(red_wine$chlorides, 0.95)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07800 0.07914 0.08800 0.12600
ggplot(red_wine, aes(x=density)) +
  geom_density() +
  stat_function(linetype = 'dashed',
                color = 'royalblue',
                fun = dnorm,
                args = list(mean = mean(red_wine$density), sd = sd(red_wine$density)))

Univariate Analysis

What is the structure of your dataset?

文档中包含了1599条记录,每一条记录包含了12个属性。

What is/are the main feature(s) of interest in your dataset?

是什么因素导致了红酒质量的变化。 但是数据中的红酒的评分的范围在3~8分之间,所以没有特别好的酒和特别差的酒。5.6360225

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

目前只探究了单一变量的一些数据情况,还没有办法知道哪些因素是影响红酒的质量的元素

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

round(cor(red_wine), 3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000           -0.256       0.672
## volatile.acidity            -0.256            1.000      -0.552
## citric.acid                  0.672           -0.552       1.000
## residual.sugar               0.115            0.002       0.144
## chlorides                    0.094            0.061       0.204
## free.sulfur.dioxide         -0.154           -0.011      -0.061
## total.sulfur.dioxide        -0.113            0.076       0.036
## density                      0.668            0.022       0.365
## pH                          -0.683            0.235      -0.542
## sulphates                    0.183           -0.261       0.313
## alcohol                     -0.062           -0.202       0.110
## quality                      0.124           -0.391       0.226
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.115     0.094              -0.154
## volatile.acidity              0.002     0.061              -0.011
## citric.acid                   0.144     0.204              -0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201              -0.022
## pH                           -0.086    -0.265               0.070
## sulphates                     0.006     0.371               0.052
## alcohol                       0.042    -0.221              -0.069
## quality                       0.014    -0.129              -0.051
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                      -0.113   0.668 -0.683     0.183  -0.062
## volatile.acidity                    0.076   0.022  0.235    -0.261  -0.202
## citric.acid                         0.036   0.365 -0.542     0.313   0.110
## residual.sugar                      0.203   0.355 -0.086     0.006   0.042
## chlorides                           0.047   0.201 -0.265     0.371  -0.221
## free.sulfur.dioxide                 0.668  -0.022  0.070     0.052  -0.069
## total.sulfur.dioxide                1.000   0.071 -0.066     0.043  -0.206
## density                             0.071   1.000 -0.342     0.149  -0.496
## pH                                 -0.066  -0.342  1.000    -0.197   0.206
## sulphates                           0.043   0.149 -0.197     1.000   0.094
## alcohol                            -0.206  -0.496  0.206     0.094   1.000
## quality                            -0.185  -0.175 -0.058     0.251   0.476
##                      quality
## fixed.acidity          0.124
## volatile.acidity      -0.391
## citric.acid            0.226
## residual.sugar         0.014
## chlorides             -0.129
## free.sulfur.dioxide   -0.051
## total.sulfur.dioxide  -0.185
## density               -0.175
## pH                    -0.058
## sulphates              0.251
## alcohol                0.476
## quality                1.000
ggplot(red_wine, aes(x = alcohol, y = quality)) +
  geom_point()

ggplot(red_wine, aes(x = alcohol, y = quality)) +
  geom_jitter(alpha = 0.25) +
  geom_smooth(method = "lm")

从图形中来看,红酒的质量和酒精浓度有点正相关,相关度为0.476

ggplot(red_wine, aes(x = residual.sugar, y = quality)) +
  xlim(0, quantile(red_wine$residual.sugar, 0.95)) +
  xlab("residual sugar (bottom 95%") +
  geom_jitter(alpha = 0.15)
## Warning: Removed 81 rows containing missing values (geom_point).

ggplot(red_wine, aes(x = volatile.acidity, y = quality)) +
  geom_jitter(alpha = 0.25) +
  geom_smooth(method = 'lm')

ggplot(red_wine, aes(x = fixed.acidity, y = pH)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = 'lm')

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

所有属性中,和红酒质量评分有较高相关性的属性就是“酒精浓度”,相关性达到了0.476 而红酒的质量又和挥发酸有比较强的负相关,挥发酸越强,红酒的质量相对较差。

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

密度和酒精浓度有很强的负相关,这个挺意外的,可能是因为自己对红酒的组成元素一点都不了解吧。。。

What was the strongest relationship you found?

酸度和PH值的负相关性,这个比较好猜,酸度越大PH值越低。

Multivariate Plots Section

ggplot(red_wine, aes(x = alcohol, y = quality, color = volatile.acidity)) +
  geom_jitter() +
  scale_color_gradient(high = 'blue', low = 'green')

ggplot(red_wine, aes(x = alcohol, y = quality, color = citric.acid)) +
  geom_jitter() +
  scale_color_gradient(high = 'green', low = 'blue')

ggplot(red_wine, aes(x = alcohol, y = volatile.acidity, color = factor(quality))) +
  geom_jitter() +
  scale_color_brewer()

ggplot(red_wine, aes(x = alcohol, y = quality, color = citric.acid)) +
  geom_jitter() +
  scale_color_gradient(high = 'red', low = 'blue')

ggplot(red_wine, aes(x = alcohol, y = density, color = residual.sugar)) +
  geom_jitter() +
  scale_color_gradient2(limits=c(0, quantile(red_wine$residual.sugar, 0.95)),
                        midpoint = median(red_wine$residual.sugar))

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

通过图像表示,酒精浓度提升,挥发酸下降时,相应的红酒质量是提升的。

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

ggplot(red_wine, aes(alcohol)) +
  geom_histogram(binwidth = 0.1) +
  geom_vline(xintercept = median(red_wine$alcohol), color = 'royalblue') +
  annotate('text',
           x = median(red_wine$alcohol) - 0.35,
           y = 120,
           label = paste('median\n(', median(red_wine$alcohol), ')', sep = ''),
           color = 'royalblue') +
  geom_vline(xintercept = mean(red_wine$alcohol), color = 'red') +
  annotate('text',
           x = mean(red_wine$alcohol) + 0.35,
           y = 120,
           label = paste('mean\n(', round(mean(red_wine$alcohol), 2), ')', sep = ''),
           color = 'red') +
  xlab("Alcohol (%)") +
  ylab("Numbers")

Description One

根据数据酒精浓度和红酒质量存在相关性,所以这里了解下不同酒精浓度的数量情况。 可以看到均值(10.2)小于中位数(10.42)

Plot Two

ggplot(red_wine, aes(x = alcohol, y = quality)) +
  geom_jitter(alpha = 0.1, height = 0.48, width = 0.025) +
  geom_smooth(method = "lm") +
  ggtitle("Quality vs Alcohol Content") +
  xlab("Alcohol (%)") +
  ylab("Quality (0-10)")

Description Two

这里展示了两个变量之间的相关性。

Plot Three

ggplot(red_wine, aes(x = alcohol, y = volatile.acidity, color = factor(quality))) +
  geom_jitter() +
  scale_color_brewer(name = "Quality") +
  ggtitle("Quality by Volitile Acidity and Alcohol") +
  xlab("Alcohol (%)") +
  ylab("Volitile Acidity (g/L)")

Description Three

这里展示了当红酒的质量提升时,对应的酒精含量上升同时挥发酸下降。


Reflection

  1. 首先了解整个数据结构,一共存在1599条记录,每条记录中包含了和酒质量相关的12个属性
  2. 但是由于要研究红酒质量和什么成分有关,但是样本数据中却没有高质量(大于8分)的酒的数据,所以可能会影响整体分析结果。
  3. 通过分析两两数据的相关性,初步锁定红酒的质量和酒精浓度有一定的相关性
  4. 在锁定了1个变量之后,再去需要是不是有第2个元素对红酒质量的影响,所以发现了挥发酸这个属性与酒精质量成反相关。